{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## One Hot Encoding - variables with many categories\n",
    "\n",
    "We observed in the previous lecture that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding we will expand the feature space dramatically.\n",
    "\n",
    "See below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>v</td>\n",
       "      <td>at</td>\n",
       "      <td>a</td>\n",
       "      <td>d</td>\n",
       "      <td>u</td>\n",
       "      <td>j</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>t</td>\n",
       "      <td>av</td>\n",
       "      <td>e</td>\n",
       "      <td>d</td>\n",
       "      <td>y</td>\n",
       "      <td>l</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>w</td>\n",
       "      <td>n</td>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>x</td>\n",
       "      <td>j</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>t</td>\n",
       "      <td>n</td>\n",
       "      <td>f</td>\n",
       "      <td>d</td>\n",
       "      <td>x</td>\n",
       "      <td>l</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>v</td>\n",
       "      <td>n</td>\n",
       "      <td>f</td>\n",
       "      <td>d</td>\n",
       "      <td>h</td>\n",
       "      <td>d</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  X1  X2 X3 X4 X5 X6\n",
       "0  v  at  a  d  u  j\n",
       "1  t  av  e  d  y  l\n",
       "2  w   n  c  d  x  j\n",
       "3  t   n  f  d  x  l\n",
       "4  v   n  f  d  h  d"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# let's load the mercedes benz dataset for demonstration, only the categorical variables\n",
    "\n",
    "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X1 :  27  labels\n",
      "X2 :  44  labels\n",
      "X3 :  7  labels\n",
      "X4 :  4  labels\n",
      "X5 :  29  labels\n",
      "X6 :  12  labels\n"
     ]
    }
   ],
   "source": [
    "# let's have a look at how many labels each variable has\n",
    "\n",
    "for col in data.columns:\n",
    "    print(col, ': ', len(data[col].unique()), ' labels')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4209, 117)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's examine how many columns we will obtain after one hot encoding these variables\n",
    "pd.get_dummies(data, drop_first=True).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that from just 6 initial categorical variables, we end up with 117 new variables. \n",
    "\n",
    "These numbers are still not huge, and in practice we could work with them relatively easily. However, in business datasets and also other Kaggle or KDD datasets, it is not unusual to find several categorical variables with multiple labels. And if we use one hot encoding on them, we will end up with datasets with thousands of columns.\n",
    "\n",
    "What can we do instead?\n",
    "\n",
    "In the winning solution of the KDD 2009 cup: \"Winning the KDD Cup Orange Challenge with Ensemble Selection\" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (0) for a particular observation.\n",
    "\n",
    "How can we do that in python?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "as    1659\n",
       "ae     496\n",
       "ai     415\n",
       "m      367\n",
       "ak     265\n",
       "r      153\n",
       "n      137\n",
       "s       94\n",
       "f       87\n",
       "e       81\n",
       "Name: X2, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's find the top 10 most frequent categories for the variable X2\n",
    "\n",
    "data.X2.value_counts().sort_values(ascending=False).head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's make a list with the most frequent categories of the variable\n",
    "\n",
    "top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]\n",
    "top_10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X2</th>\n",
       "      <th>as</th>\n",
       "      <th>ae</th>\n",
       "      <th>ai</th>\n",
       "      <th>m</th>\n",
       "      <th>ak</th>\n",
       "      <th>r</th>\n",
       "      <th>n</th>\n",
       "      <th>s</th>\n",
       "      <th>f</th>\n",
       "      <th>e</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>at</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>av</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>n</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>n</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>n</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>e</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>e</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>as</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>as</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>aq</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   X2  as  ae  ai  m  ak  r  n  s  f  e\n",
       "0  at   0   0   0  0   0  0  0  0  0  0\n",
       "1  av   0   0   0  0   0  0  0  0  0  0\n",
       "2   n   0   0   0  0   0  0  1  0  0  0\n",
       "3   n   0   0   0  0   0  0  1  0  0  0\n",
       "4   n   0   0   0  0   0  0  1  0  0  0\n",
       "5   e   0   0   0  0   0  0  0  0  0  1\n",
       "6   e   0   0   0  0   0  0  0  0  0  1\n",
       "7  as   1   0   0  0   0  0  0  0  0  0\n",
       "8  as   1   0   0  0   0  0  0  0  0  0\n",
       "9  aq   0   0   0  0   0  0  0  0  0  0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and now we make the 10 binary variables\n",
    "\n",
    "for label in top_10:\n",
    "    data[label] = np.where(data['X2']==label, 1, 0)\n",
    "\n",
    "data[['X2']+top_10].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X2_as</th>\n",
       "      <th>X2_ae</th>\n",
       "      <th>X2_ai</th>\n",
       "      <th>X2_m</th>\n",
       "      <th>X2_ak</th>\n",
       "      <th>X2_r</th>\n",
       "      <th>X2_n</th>\n",
       "      <th>X2_s</th>\n",
       "      <th>X2_f</th>\n",
       "      <th>X2_e</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>v</td>\n",
       "      <td>at</td>\n",
       "      <td>a</td>\n",
       "      <td>d</td>\n",
       "      <td>u</td>\n",
       "      <td>j</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>t</td>\n",
       "      <td>av</td>\n",
       "      <td>e</td>\n",
       "      <td>d</td>\n",
       "      <td>y</td>\n",
       "      <td>l</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>w</td>\n",
       "      <td>n</td>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>x</td>\n",
       "      <td>j</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>t</td>\n",
       "      <td>n</td>\n",
       "      <td>f</td>\n",
       "      <td>d</td>\n",
       "      <td>x</td>\n",
       "      <td>l</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>v</td>\n",
       "      <td>n</td>\n",
       "      <td>f</td>\n",
       "      <td>d</td>\n",
       "      <td>h</td>\n",
       "      <td>d</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  X1  X2 X3 X4 X5 X6  X2_as  X2_ae  X2_ai  X2_m  X2_ak  X2_r  X2_n  X2_s  \\\n",
       "0  v  at  a  d  u  j      0      0      0     0      0     0     0     0   \n",
       "1  t  av  e  d  y  l      0      0      0     0      0     0     0     0   \n",
       "2  w   n  c  d  x  j      0      0      0     0      0     0     1     0   \n",
       "3  t   n  f  d  x  l      0      0      0     0      0     0     1     0   \n",
       "4  v   n  f  d  h  d      0      0      0     0      0     0     1     0   \n",
       "\n",
       "   X2_f  X2_e  \n",
       "0     0     0  \n",
       "1     0     0  \n",
       "2     0     0  \n",
       "3     0     0  \n",
       "4     0     0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get whole set of dummy variables, for all the categorical variables\n",
    "\n",
    "def one_hot_top_x(df, variable, top_x_labels):\n",
    "    # function to create the dummy variables for the most frequent labels\n",
    "    # we can vary the number of most frequent labels that we encode\n",
    "    \n",
    "    for label in top_x_labels:\n",
    "        df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)\n",
    "\n",
    "# read the data again\n",
    "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])\n",
    "\n",
    "# encode X2 into the 10 most frequent categories\n",
    "one_hot_top_x(data, 'X2', top_10)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "      <th>X3</th>\n",
       "      <th>X4</th>\n",
       "      <th>X5</th>\n",
       "      <th>X6</th>\n",
       "      <th>X2_as</th>\n",
       "      <th>X2_ae</th>\n",
       "      <th>X2_ai</th>\n",
       "      <th>X2_m</th>\n",
       "      <th>...</th>\n",
       "      <th>X1_aa</th>\n",
       "      <th>X1_s</th>\n",
       "      <th>X1_b</th>\n",
       "      <th>X1_l</th>\n",
       "      <th>X1_v</th>\n",
       "      <th>X1_r</th>\n",
       "      <th>X1_i</th>\n",
       "      <th>X1_a</th>\n",
       "      <th>X1_c</th>\n",
       "      <th>X1_o</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>v</td>\n",
       "      <td>at</td>\n",
       "      <td>a</td>\n",
       "      <td>d</td>\n",
       "      <td>u</td>\n",
       "      <td>j</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>t</td>\n",
       "      <td>av</td>\n",
       "      <td>e</td>\n",
       "      <td>d</td>\n",
       "      <td>y</td>\n",
       "      <td>l</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>w</td>\n",
       "      <td>n</td>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>x</td>\n",
       "      <td>j</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>t</td>\n",
       "      <td>n</td>\n",
       "      <td>f</td>\n",
       "      <td>d</td>\n",
       "      <td>x</td>\n",
       "      <td>l</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>v</td>\n",
       "      <td>n</td>\n",
       "      <td>f</td>\n",
       "      <td>d</td>\n",
       "      <td>h</td>\n",
       "      <td>d</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  X1  X2 X3 X4 X5 X6  X2_as  X2_ae  X2_ai  X2_m  ...   X1_aa  X1_s  X1_b  \\\n",
       "0  v  at  a  d  u  j      0      0      0     0  ...       0     0     0   \n",
       "1  t  av  e  d  y  l      0      0      0     0  ...       0     0     0   \n",
       "2  w   n  c  d  x  j      0      0      0     0  ...       0     0     0   \n",
       "3  t   n  f  d  x  l      0      0      0     0  ...       0     0     0   \n",
       "4  v   n  f  d  h  d      0      0      0     0  ...       0     0     0   \n",
       "\n",
       "   X1_l  X1_v  X1_r  X1_i  X1_a  X1_c  X1_o  \n",
       "0     0     1     0     0     0     0     0  \n",
       "1     0     0     0     0     0     0     0  \n",
       "2     0     0     0     0     0     0     0  \n",
       "3     0     0     0     0     0     0     0  \n",
       "4     0     1     0     0     0     0     0  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# find the 10 most frequent categories for X1\n",
    "top_10 = [x for x in data.X1.value_counts().sort_values(ascending=False).head(10).index]\n",
    "\n",
    "# now create the 10 most frequent dummy variables for X1\n",
    "one_hot_top_x(data, 'X1', top_10)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One Hot encoding of top variables\n",
    "\n",
    "### Advantages\n",
    "\n",
    "- Straightforward to implement\n",
    "- Does not require hrs of variable exploration\n",
    "- Does not expand massively the feature space (number of columns in the dataset)\n",
    "\n",
    "### Disadvantages\n",
    "\n",
    "- Does not add any information that may make the variable more predictive\n",
    "- Does not keep the information of the ignored labels\n",
    "\n",
    "\n",
    "Because it is not unusual that categorical variables have a few dominating categories and the remaining labels add mostly noise, this is a quite simple and straightforward approach that may be useful on many occasions.\n",
    "\n",
    "It is worth noting that the top 10 variables is a totally arbitrary number. You could also choose the top 5, or top 20.\n",
    "\n",
    "This modelling was more than enough for the team to win the KDD 2009 cup. They did do some other powerful feature engineering as we will see in following lectures, that improved the performance of the variables dramatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
